Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistently report all dimensions in error messages if invalid dimensions are given #8079

Merged
merged 13 commits into from
Sep 9, 2023

Conversation

mgunyho
Copy link
Contributor

@mgunyho mgunyho commented Aug 17, 2023

Hello,

I noticed that arr.min("nonexistent") raises an error with a very helpful message

ValueError: 'nonexistent' not found in array dimensions ('x', 'y', 'z')

while arr.idxmin("nonexistent") raises

KeyError: 'Dimension "nonexistent" not in dimension' [sic]

IMO, the list of dimensions should always be shown in the error message for these kinds of errors, it makes debugging much easier. With this PR, I have implemented this behavior for all such functions that I could find.

There is quite a consistent pattern which I think could be factored out into a function, but I didn't have a clear enough picture of the structure of the whole code to do it.

I didn't fix the tests yet, I'll do it if you think this can be merged.

  • Searched list of issues, couldn't find one related to this
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst

@mgunyho mgunyho marked this pull request as draft August 17, 2023 16:18
@TomNicholas
Copy link
Member

This is great, thank you @mgunyho !

@kmuehlbauer kmuehlbauer marked this pull request as ready for review August 17, 2023 19:03
@kmuehlbauer
Copy link
Contributor

Not sure, but what to do if we have say tens or even hundreds of dimensions? Maybe that's not the majority of use cases but we should be prepared.

BTW, I've marked this as ready for review by accident, sorry.

@mgunyho
Copy link
Contributor Author

mgunyho commented Aug 18, 2023

what to do if we have say tens or even hundreds of dimensions?

The maximum number of dimensions for a numpy array is 32, and seems like in the near future it's going to be increased to 64 at most: numpy/numpy#5744. The same limit seems to apply for dask.

But okay, a dataset where each variable has a different coordinate can have lots of dimensions, like #5546. Although there it is also mentioned that in reality usually the number of dimensions is on the order of 10.

IMO a 32-item list in the error message is a bit ugly but still acceptable. If a use case comes up where this is a problem, we could add logic to limit the number of items shown in the list (this would be a good reason to factor out a function).

I'll fix the tests soon and then mark this as ready for review for real.

@mgunyho mgunyho marked this pull request as draft August 18, 2023 08:40
@kmuehlbauer
Copy link
Contributor

@mgunyho I totally agree with your reasoning, but I just wanted to mention it as a possible problem source. Thanks for taking care!

@mgunyho mgunyho force-pushed the report-dims-in-errors branch 2 times, most recently from acd1e31 to 96d8f68 Compare August 19, 2023 12:03
@mgunyho mgunyho marked this pull request as ready for review August 19, 2023 13:18
@mgunyho
Copy link
Contributor Author

mgunyho commented Aug 19, 2023

I now went through all relevant ValueErrors and KeyErrors and updated the error messages where applicable.

I left out a couple of instances involving data_vars, because it's more likely have lots of them (like in #5546).

For Dataset, {dataset.dims!r} shows up as Frozen({"dim1": 3, "dim2": 4}), so I used tuple(dataset.dims). I used tuple instead of list because that's what we have done earlier, see here and here. Personally I would maybe prefer list, because a single-element tuple ("dim",) looks a bit confusing compared to a list ["dim"].

Note that arr.rolling(nonexistent=3) and arr.idxmin("nonexistent") raise KeyError, while coarsen() and min() raise ValueError. I also found MissingDimensionsError in variables.py which is a subclass of ValueError, but that's only used by one function.

doc/whats-new.rst Outdated Show resolved Hide resolved
xarray/tests/test_dataset.py Outdated Show resolved Hide resolved
with pytest.raises(
ValueError,
match=re.escape(
"Dimensions ('space',) not found in dataset dimensions ('time',)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now says "dataset" even though it's a DataArray, because DataArray.drop_duplicates just does self._to_temp_dataset().drop_duplicates(dim, keep=keep). Maybe the error message could say "data dimensions"? Should I change it everywhere to say just data instead of dataset?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think not mentioning Dataset would be better. There might be an example somewhere else of handling this same type of ambiguity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall seeing an error message saying "data dimensions", and I found this: https://github.com/pydata/xarray/blob/main/xarray/core/variable.py#L677 (seems pretty closely related to this PR and #8089). I can change the wording to say just "data dimensions", I think it's a good way to put it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is "{self.__class__.__name__} dimensions", which is used in a couple of places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it now to "data dimensions".

@mgunyho
Copy link
Contributor Author

mgunyho commented Aug 19, 2023

Also, we should probably add the "error-reporting" label to this PR, I can't seem to be able to do it

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would maybe prefer list

Given that we are not trying to communicate anything about the type (i.e. regardless of the error raised the dimensions are not stored as a list internally) then I think you can print them however you feel is neatest.

xarray/core/computation.py Show resolved Hide resolved
"some variables in data_vars are not data variables "
f"on the first dataset: {invalid_vars}"
f"the variables {invalid_vars} in data_vars are not "
f"found in the data variables of the first dataset"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"found in the data variables of the first dataset"
f"found in the data variables of the first dataset {valid_vars}"

Copy link
Contributor Author

@mgunyho mgunyho Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I explicitly left out listing the data variables, because it's more likely to have many of them like in #5546 (see also the comment in the code right above this)

@mgunyho
Copy link
Contributor Author

mgunyho commented Aug 21, 2023

Personally I would maybe prefer list

Given that we are not trying to communicate anything about the type (i.e. regardless of the error raised the dimensions are not stored as a list internally) then I think you can print them however you feel is neatest.

Can you @Illviljan say why you preferred tuple over list here #7821 (comment) ?

@Illviljan
Copy link
Contributor

I don't remember that suggestion but maybe here's the reason:

  • tuples are faster than list to initialize, if mutability is not needed then use tuple.
  • dims are usually returned as Tuple[Hashable, ...], see DataArray().dims

@mgunyho
Copy link
Contributor Author

mgunyho commented Aug 26, 2023

tuples are faster than list to initialize

I suppose here this doesn't make much of a difference, since the number of items is fairly small.

If mutability is not needed then use tuple, dims are usually returned as Tuple[Hashable, ...],

These are fair points. I've left it as tuple for now.

@dcherian dcherian added the plan to merge Final call for comments label Sep 8, 2023
@dcherian
Copy link
Contributor

dcherian commented Sep 8, 2023

Thanks @mgunyho this is a great improvement. Thank you!

@dcherian
Copy link
Contributor

dcherian commented Sep 9, 2023

Note that arr.rolling(nonexistent=3) and arr.idxmin("nonexistent") raise KeyError, while coarsen() and min() raise ValueError. I also found MissingDimensionsError in variables.py which is a subclass of ValueError, but that's only used by one function.

I think we should harmonize all these to ValueError but lets do that in a new issue & PR

@dcherian dcherian changed the title Consistently report all data dimensions in error messages if invalid dimensions are given Consistently report all dimensions in error messages if invalid dimensions are given Sep 9, 2023
@dcherian dcherian merged commit 0afbd45 into pydata:main Sep 9, 2023
25 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants